30 research outputs found

    Q-learning with Nearest Neighbors

    Full text link
    We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a dd-dimensional state space and the discounted factor γ∈(0,1)\gamma \in (0,1), given an arbitrary sample path with "covering time" L L , we establish that the algorithm is guaranteed to output an ε\varepsilon-accurate estimate of the optimal Q-function using O~(L/(ε3(1−γ)7))\tilde{O}\big(L/(\varepsilon^3(1-\gamma)^7)\big) samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as O~(1/εd), \tilde{O}\big(1/\varepsilon^d\big), so the sample complexity scales as O~(1/εd+3).\tilde{O}\big(1/\varepsilon^{d+3}\big). Indeed, we establish a lower bound that argues that the dependence of Ω~(1/εd+2) \tilde{\Omega}\big(1/\varepsilon^{d+2}\big) is necessary.Comment: Accepted to NIPS 201

    Greed Works -- Online Algorithms For Unrelated Machine Stochastic Scheduling

    Get PDF
    This paper establishes performance guarantees for online algorithms that schedule stochastic, nonpreemptive jobs on unrelated machines to minimize the expected total weighted completion time. Prior work on unrelated machine scheduling with stochastic jobs was restricted to the offline case, and required linear or convex programming relaxations for the assignment of jobs to machines. The algorithms introduced in this paper are purely combinatorial. The performance bounds are of the same order of magnitude as those of earlier work, and depend linearly on an upper bound on the squared coefficient of variation of the jobs' processing times. Specifically for deterministic processing times, without and with release times, the competitive ratios are 4 and 7.216, respectively. As to the technical contribution, the paper shows how dual fitting techniques can be used for stochastic and nonpreemptive scheduling problems.Comment: Preliminary version appeared in IPCO 201

    Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

    Full text link
    We develop several provably efficient model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). We consider both online setting and the setting with access to a simulator. In the online setting, we propose model-free RL algorithms based on reference-advantage decomposition. Our algorithm achieves O~(S5A2sp(h∗)T)\widetilde{O}(S^5A^2\mathrm{sp}(h^*)\sqrt{T}) regret after TT steps, where S×AS\times A is the size of state-action space, and sp(h∗)\mathrm{sp}(h^*) the span of the optimal bias function. Our results are the first to achieve optimal dependence in TT for weakly communicating MDPs. In the simulator setting, we propose a model-free RL algorithm that finds an ϵ\epsilon-optimal policy using O~(SAsp2(h∗)ϵ2+S2Asp(h∗)ϵ)\widetilde{O} \left(\frac{SA\mathrm{sp}^2(h^*)}{\epsilon^2}+\frac{S^2A\mathrm{sp}(h^*)}{\epsilon} \right) samples, whereas the minimax lower bound is Ω(SAsp(h∗)ϵ2)\Omega\left(\frac{SA\mathrm{sp}(h^*)}{\epsilon^2}\right). Our results are based on two new techniques that are unique in the average-reward setting: 1) better discounted approximation by value-difference estimation; 2) efficient construction of confidence region for the optimal bias function with space complexity O(SA)O(SA)

    Scheduling and resource allocation for clouds: novel algorithms, state space collapse and decay of tails

    Get PDF
    Scheduling and resource allocation in cloud systems is of fundamental importance to system efficiency. The focus of this thesis is to study the fundamental limits of the scheduling and resource allocation problems in clouds, and design provably high-performance algorithms. In the first part, we consider data-centric scheduling. Data-intensive applications are posing increasingly significant challenges to scheduling in today's computing clusters. The presence of data induces an extremely heterogeneous cluster where processing speed depends on the task-server pair. The situation is further complicated by ever-changing technologies of networking, memory, and software architecture. As a result, a suboptimal scheduling algorithm causes unnecessary delay in job completion and wastes system capacity. We propose a versatile model featuring a multi-class parallel-server system that readily incorporates different characteristics of a variety of systems. The model has been studied by Harrison, Williams and Stolyar, respectively. However, delay optimality in heavy traffic with unknown arrival rate vectors has remained an open problem. We propose novel algorithms that achieve delay optimality with unknown arrival rates. This enables the application of proposed algorithms to data-centric clusters. New proof techniques are required including construction of an ideal load decomposition. To demonstrate the effectiveness of the proposed algorithms, we implement a Hadoop MapReduce scheduler and show that it achieves more than 10 times improvement over existing schedulers. The second part studies the resource allocation problem for clouds that provide infrastructure as a service, in the form of virtual machines (VMs). Consolidation of multiple VMs on a single physical machine (PM) has been advocated for improving system utilization. VMs placed on the same PM are subject to resource "packing constraint," leading to stochastic dynamic bin packing models for the real-time assignment of VMs to PMs in a data center. Due to finite-sized pools of servers, incoming requests might not be fulfilled immediately and such requests are typically rejected. Hence a meaningful metric in practice is the blocking probability for arriving VM requests. We analyze the power-of-d-choices algorithm, a well-known stateless randomized routing policy with low scheduling overhead. We establish an explicit upper bound on the equilibrium blocking probability, and further demonstrate that the blocking probability exhibits distinct behaviors in different load regions: doubly-exponential decay in the heavy-traffic regime and exponential decay in the critically loaded regime

    RL-QN: A Reinforcement Learning Framework for Optimal Control of Queueing Systems

    Full text link
    With the rapid advance of information technology, network systems have become increasingly complex and hence the underlying system dynamics are often unknown or difficult to characterize. Finding a good network control policy is of significant importance to achieve desirable network performance (e.g., high throughput or low delay). In this work, we consider using model-based reinforcement learning (RL) to learn the optimal control policy for queueing networks so that the average job delay (or equivalently the average queue backlog) is minimized. Traditional approaches in RL, however, cannot handle the unbounded state spaces of the network control problem. To overcome this difficulty, we propose a new algorithm, called Reinforcement Learning for Queueing Networks (RL-QN), which applies model-based RL methods over a finite subset of the state space, while applying a known stabilizing policy for the rest of the states. We establish that the average queue backlog under RL-QN with an appropriately constructed subset can be arbitrarily close to the optimal result. We evaluate RL-QN in dynamic server allocation, routing and switching problems. Simulation results show that RL-QN minimizes the average queue backlog effectively

    Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes

    Full text link
    We consider Linear Stochastic Approximation (LSA) with a constant stepsize and Markovian data. Viewing the joint process of the data and LSA iterate as a time-homogeneous Markov chain, we prove its convergence to a unique limiting and stationary distribution in Wasserstein distance and establish non-asymptotic, geometric convergence rates. Furthermore, we show that the bias vector of this limit admits an infinite series expansion with respect to the stepsize. Consequently, the bias is proportional to the stepsize up to higher order terms. This result stands in contrast with LSA under i.i.d. data, for which the bias vanishes. In the reversible chain setting, we provide a general characterization of the relationship between the bias and the mixing time of the Markovian data, establishing that they are roughly proportional to each other. While Polyak-Ruppert tail-averaging reduces the variance of the LSA iterates, it does not affect the bias. The above characterization allows us to show that the bias can be reduced using Richardson-Romberg extrapolation with m≥2m\ge 2 stepsizes, which eliminates the m−1m-1 leading terms in the bias expansion. This extrapolation scheme leads to an exponentially smaller bias and an improved mean squared error, both in theory and empirically. Our results immediately apply to the Temporal Difference learning algorithm with linear function approximation, Markovian data, and constant stepsizes
    corecore